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Abstract. Data on molecular interactions is increasing at a tremendous 
pace, while the development of solid methods for analyzing this network 
data is lagging behind. This holds in particular for the field of compara- 
tive network analysis, where one wants to identify commonalities between 
biological networks. Since biological functionality primarily operates at 
the network level, there is a clear need for topology-aware comparison 
methods. In this paper we present a method for global network align- 
ment that is fast and robust, and can flexibly deal with various scoring 
schemes taking both node-to-node correspondences as well as network 
topologies into account. It is based on an integer linear programming 
formulation, generalizing the well-studied quadratic assignment problem. 
We obtain strong upper and lower bounds for the problem by improv- 
ing a Lagrangian relaxation approach and introduce the software tool 
NATALIE 2.0, a publicly available implementation of our method. In an 
extensive computational study on protein interaction networks for six 
different species, we find that our new method outperforms alternative 
state-of-the-art methods with respect to quality and running time. 



1 Introduction 

In the last decade, data on molecular interactions has increased at a tremendous 
pace. For instance, the STRING database [24], which contains protein protein 
interaction (PPI) data, grew from 261,033 proteins in 89 organisms in 2003 to 
5,214,234 proteins in 1,133 organisms in May 2011, more than doubling the num- 
ber of proteins in the database every two years. The same trends can be observed 
for other types of biological networks, including metabolic, gene-regulatory, sig- 
nal transduction and metagenomic networks, where the latter can incorporate 
the excretion and uptake of organic compounds through, for example, a micro- 



bial community 12 21 . In addition to the plethora of experimentally derived 
network data for many species, also the structure and behavior of molecular 
networks have become intensively studied over the last few years |2], leading to 
the observation of many conserved features at the network level. However, the 
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development of solid methods for analyzing network data is lagging behind, par- 
ticularly in the field of comparative network analysis. Here, one wants to identify 
commonalities between biological networks from different strains or species, or 
derived form different conditions. Based on the assumption that evolutionary 
conservation implies functional significance, comparative approaches may help 
(i) improve the accuracy of data, (ii) generate, investigate, and validate hypothe- 
ses, and (iii) transfer functional annotations. Until recently, the most common 
way of comparing two networks has been to solely consider node-to-node cor- 
respondences, for example by finding homologous relationships between nodes 
(e.g. proteins in PPI networks) of either network, while the topology of the two 
networks has not been taken into account. Since biological functionality pri- 
marily operates at the network level, there is a clear need for topology-aware 
comparison methods. In this paper we present a network alignment method that 
is fast and robust, and can flexibly deal with various scoring schemes taking both 
node-to-node correspondences as well as network topologies into account. 



Previous work. Network alignment establishes node correspondences based 
on both node-to-node similarities and conserved topological information. Sim- 
ilar to sequence alignment, local network alignment aims at identifying one or 
more shared subnetworks, whereas global network alignment addresses the over- 
all comparison of the complete input networks. 

Over the last years a number of methods have been proposed for both 
global and local network alignment, for example PathBlast fl4], NETWORK- 
Blast 



[22[ MaWISh 

SubMAP |4 . PathBlast heuristically computes high-scoring similar paths in 
two PPI networks. Detecting protein complexes has been addressed with Net- 
workBlast by Sharan et al. 22 , where the authors introduce a probabilistic 
model and propose a heuristic greedy approach to search for shared complexes. 
Koyutiirk et al. 



16] , Graemlin [8], IsoRank [23], Graal [l7j, and 



1G 



use a more elaborate scoring scheme based on an evolu- 
tionary model to compute local pairwise alignments of PPI networks. The Iso- 
Rank algorithm by Singh et al. [23] approaches the global alignment problem by 
preferably matching nodes which have a similar neighborhood, which is elegantly 
solved as an eigenvalue problem. Kuchaiev et al. 17 take a similar approach. 



Their method Graal matches nodes that share a similar distribution of so-called 
graphlets, which are small connected non-isomorphic induced subgraphs. 

In this paper we focus on pairwise global network alignment, where an align- 
ment is scored by summing up individual scores of aligned node and interaction 
pairs. Among the above mentioned methods, IsoRank and Graal use a scoring 
model that can be expressed in this manner. 



Contribution. We present an algorithm for global network alignment based on 
an integer linear programming (ILP) formulation, generalizing the well-studied 
quadratic assignment problem (QAP). We improve upon an existing Lagrangian 
relaxation approach presented in previous work 15 to obtain strong upper and 
lower bounds for the problem. We exploit the closeness to QAP and generalize 



Sparse Global Network Alignment 3 

1, if vi = A and v 2 = a, 

1, if «i = B and v 2 = b, 

1 , if Vi = C and v 2 = c, 

1 , if «i = D and v 2 = d, 

0, otherwise. 

10, if (-Ui/uJi) £ E 1 

and (« 2 ,'(«2) 6 £ 2 , 
0, otherwise. 

Fig. 1: Example of a network alignment. With the given scoring function, the 
alignment has a score of 4 + 40 = 44. 

a dual descent method for updating the Lagrangian multipliers to the general- 
ized problem. We have implemented the revised algorithm from scratch as the 
software tool NATALIE 2.0. In an extensive computational study on protein in- 
teraction networks for six different species, we compare NATALIE 2.0 to Graal 
and IsoRank, evaluating the number of conserved edges as well as functional 
coherence of the modules in terms of GO annotation. We find that NATALIE 2.0 
outperforms the alternative methods with respect to quality and running time. 
Our software tool NATALIE 2.0 as well as all data sets used in this study are 
publicly available at j http : / / planet- lis a. net | 

2 Preliminaries 

Given two simple graphs G\ = (V\,E\) and Gi — (V2,^2), an alignment 
a : V\ — t V2 is a partial injective function from V\ to V 2 ■ As such we have that an 
alignment relates every node in V\ to at most one node in V2 and that conversely 
every node in V2 has at most one counterpart in V±. An alignment is assigned a 
real-valued score using an additive scoring function s defined as follows: 

s i a ) = ^2 c (v,a{v)) + ^2 w(v,a(v),w,a(w)) (1) 

where c : V\ X V2 — > R is the score of aligning a pair of nodes in V\ and V% 
respectively. On the other hand, w : V\ x V 2 x V x x V 2 — > R allows for scoring 
topological similarity. The problem of global pairwise network alignment (GNA) 
is to find the highest scoring alignment a*, i.e. a* = arg max s(a). Figure [l] shows 
an example. 

NP-hardness of GNA follows by a simple reduction from the decision prob- 
lem Clique, which asks whether there is a clique of cardinality at least k in 
a given simple graph G = (V,E) [l3| . The corresponding GNA instance con- 
cerns the alignment of the complete graph of k vertices Kk = (Vk,Ek) with G 
using the scoring function s(a) = \{(v,w) G Ek \ (a(v),a(w)) e E}\. Since an 
alignment is injective, there is a clique of cardinality at least k if and only if 
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the cost of the optimal alignment is (^)- The close relationship of GNA with the 
quadratic assignment problem is more easily observed when formulating GNA as 
a mathematical program. Throughout the remainder of the text we use dummy 
variables i, j € {1, . . . , \V\\} and k, I € {1, . . . , | | } to denotes nodes in V\ and 
V2, respectively. Let C be a \Vi\ x |T^| matrix such that = c(i,k) and let 
W be a (|Vi| x |T^|) x (\Vi\ x | V2 1 ) matrix whose entries Wikji correspond to 
interaction scores w(i, k,j, I). Now we can formulate GNA as 

(IQP) 

(2) 
(3) 
(4) 

where the decision variable Xik indicates whether the i-th node in V\ is aligned 
with the fc-th node in V%. The above formulation shares many similarities with 
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i,j k,l 


wikjlXikXjl 
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E x J i - 1 
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x ik G {0, 1} 




Vi, k 



Lawler's formulation 19 of the QAP. However, instead of finding an assignment 
we are interested in finding a matching, which is reflected in constraints ([2| 
and ([3]) being inequalities rather than equalities. As can be seen in we only 
consider the upper triangle of W rather than the entire matrix. An analogous 
way of looking at this, is to consider W to be symmetric. This is usually not the 
case for QAP instances. In addition, due to the fact that biological input graphs 
are typically sparse, we have that W is sparse as well. These differences allow 
us to come up with an effective method of solving the problem as we will see in 
the following. 

3 Method 

The relaxation presented here follows the same lines as the one given by Adams 
and Johnson for the QAP [l] . We start by linearizing ( IQP I by introducing 



binary variables yikji defined as Uikji '■= XikXji and constraints yikji < Xji and 
Vikji < Xik for all i < j and k 7^ I. If we assume that all entries in W are positive, 
we do not need to enforce that yikji > Xik + Xji — 1. In Section [5] we will discuss 
this assumption. Rather than using the aforementioned constraints, we make use 
of a stronger set of constraints which we obtain by multiplying constraints ([2| 
and ^ by x ik : 



^Vikji = ^x^x^ < ^XikXji < x lk , Vi,j,k, i < j (5) 

l 1 1 

ijtk lytk 

^Vikjl ^^XikXjt <22 x ik x jl < x ik, Vi,k,l,k^l (6) 



j 3 
j>i j>% 
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We proceed by splitting the variable yikji (where i < j and k ^ I). In other words, 
we extend the objective function such that the counterpart of yikji becomes 
UjHk- This is accomplished by rewriting the dummy constraint in j6j to j 7^ i. 
In addition, we split the weights: Wikji = Wjuu = (w / ik j l /2) where w' ik ^ denotes 
the original weight. Furthermore, we require that the counterparts of the split 
decision variables assume the same value, which amounts to 



max CikXik + E w ikjiUikji 



i,j k.l 
i<j k^l 



•t. x i l - 1 

I 

E x i l - 1 

3 

yikji < %ik 

1 

l^k 

yikji < %ik 

3 

Vikjl Vjlik 

yikji e {0, 1} 
x lk e {0, 1} 



EE 

i,j kj 
i>j k=jtl 



Wikjiyikji (ILP) 

Vj (7) 

VI (8) 

Vi,i, fc, i (9) 

V*,*,l, k^l (10) 



Vi,j,k,l, i<j,kj<=l (11) 
Vi,j,k,l,i?j,k?l (12) 
Vi,fc (13) 



We can solve the continuous relaxation of ( ILP ) via its Lagrangian dual by 



dualizing the linking constraints (11) with multiplier A 



min Z LD (X) 



(LD) 



where Zld(A) equals 



max 2J Cife^ifc + E E^'fcj' + ^ikjl)Vikjl + EE^ ifej7 ~ ^3lik)Vik3l 



t.k 



i.j k.l 

i<j k^l 



i,j kj 
i>3 k^l 



s.t. 0, |8||, ([TO]), ((TJl and (B| 



Now that the linking constraints have been dualized, one can observe that the re- 
maining constraints decompose the variables into |Vi||V2| disjoint groups, where 
variables across groups are not linked by any constraint, and where each group 
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contains a variable Xi k and variables yucji for j ^ i and I ^ k. Hence, we have 
Zld(A) = max Y\c ik + v ik (\)]x ik (LD A ) 

s.t. Yl x ji ^ 1 w ( 14 ) 

i 

x fl ^ 1 yl ( 15 ) 

3 

x ik e{0,l} v*,fc (16) 

which corresponds to a maximum weight bipartite matching problem on the so- 
called alignment graph G m — (V1UV2, E m ). In the general case G m is a complete 
bipartite graph, i.e. E m — {(i,k) \ i G Vi,«2 € V2}. However, by exploiting 
biological knowledge one can make G m more sparse by excluding biologically- 
unlikely edges (see Section Eh. For the global problem, the weight of a matching 
edge (i, fc) is set to c ik + Mjfc(A), where the latter term is computed as 



Vik{X) = maxy^y^(n?j fej 7 + \kjl)Vikji + Y Y^ Wlk ^' 1 ~ ^Hk)Vikji (LD\ fc ) 
s.t. £ ift fcj -, < 1 Vj, j ^ i (17) 



X>«yi ^ 1 yi,i^k (is) 

yikjie {0,1} (19) 

Again, this is a maximum weight bipartite matching problem on the same align- 
ment graph but excluding edges incident to either i or k and using different edge 
weights: the weight of an edge (j, I) is w ik ji + Xikji if j > i, or w ik ,ji~\ jllk if j < i. 
So in order to compute Z LD (A), we need to solve a total number of ] Vi ] ] V2] + 1 
maximum weight bipartite matching problems, which, using the Hungarian al- 



gorithm 18 20 can be done in 0(n ) time, where n = max(|Vi|, |V^|). In case 
the alignment graph is sparse, i.e. 0(\E m \) = 0(n), Zld(A) can be computed 
in 0(n 4 log n) time using the successive shortest path variant of the Hungarian 
algorithm [7], It is important to note that for any A, Zld(A) is an upper bound 
on the score of an optimal alignment. This is because any alignment a is feasible 
to .Zld(A) and does not violate the original linking constraints and therefore 
has an objective value equal to s(a). In particular, the optimal alignment a* 
is also feasible to Zld(A) and hence a* < Zld(A). Since the two sets of prob- 
lems resulting from the decomposition both have the integrality property [6], 
the smallest upper bound we can achieve equals the linear programming (LP) 



bound of the continuous relaxation of ( ILP ) [9] . However, computing the small- 



est upper bound by finding suitable multipliers is much faster than solving the 
corresponding LP. Given solution (x,y) to Zld(A), we obtain a lower bound on 
s(a*), denoted Zib(A), by considering the score of the alignment encoded in x. 
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3.1 Solving Strategies 

In this section we will discuss strategies for identifying Lagrangian multipliers 
A that yield an as small as possible gap between the upper and lower bound 
resulting from the solution to Z\ jD {\). 



Subgradient optimization. We start by discussing subgradient optimization, 



which is originally due to Held and Karp 10 . The idea is to generate a sequence 



A , A , . . . of Lagrangian multiplier vectors starting from A = as follows: 

>8h = - ^1^5(4,) vij,k,i,i<j,k*i (20) 



where g{X\ k ji) corresponds to the subgradient of multiplier A* fe 7 , i.e. <?(Af fe ,-;) = 
Vikji — yjUk, and a is the step size parameter. Initially a is set to 1 and it is halved 
if neither Zld(A) nor Zib(A) have improved for over N consecutive iterations. 
Conversely, a is doubled if M times in a row there was an improvement in either 
2ld(A) or Zib(A) |5j. In case all subgradients are zero, the optimal solution 
has been found and the scheme terminates. Note that this is not guaranteed to 
happen. Therefore we abort the scheme after exceeding a time limit or a pre- 
specified number of iterations. In addition, we terminate if a has dropped below 
machine precision. Algorithm [T] gives the pseudo code of this procedure. 



Algorithm 1: SubgradientOpt(A, M, N) 

1 a <S— 1; n <S— N; m <S— M 

2 [LB*,UB*] <- [Z lh (X),Z w (X)] 

3 while g(\) / do 

4 A ^ A _ a (Z^(A)-Z, b (A)) g(At) 

5 if [LB* , UB*] \ [Zi b (A) , Z LD (A)] = then n <- n - 1 

6 else 

7 LB* <- max[LB*, Z lh (\)] 

8 UB* 4— min[UB*, Zld(A)] 

9 m •<— m — 1 

10 if n = then a <s— a/2; n ^— N 

11 if m = then a 2a; m <s— M 

12 return [LB*,UB*] 



Dual descent. In this section we derive a dual descent method which is an 
extension of the one presented in [l] . The dual descent method takes as a starting 
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point the dual of Zld (A) 
^ld(A) = 



mm 

a,/3 
S.t. 



E 

k 



OLi + fa > c ik + v ik (X) Vi,fc 
a, > Vi 
/3 fc > Vfc 



where the dual of Ujfc(A) is 



(A) = min 



i 



s.t. /4 + v\ > Wikji + Xi k ji 



> w 



ikjl 



X 



jlik 



> o 

> 



Vj,Z, j <i,lj^k 
VI. 



(21) 

(22) 
(23) 
(24) 



(25) 

(26) 
(27) 
(28) 
(29) 



Suppose that for a given A* we have computed dual variables (a, ft) solving 
(21) with objective value Zld(A'), as well as dual variables (/J, 1 ,v lk ) yielding 
values Wife (A) to linear programs ( |25[ ). The goal now is to find A* +1 such that 
the resulting bound is better or just as good, i.e. Z LV (X t+1 ) < Z LD (A'). We 
prevent the bound from increasing, by ensuring that the dual variables (a, /3) 
remain feasible to (21 ). This we can achieve by considering the slacks: 7^ (A) = 
di + /3fc — Cik — ^fe (A). So for (a, j3) to remain feasible, we can only allow every 
^ifc(A*) to increase by as much as 7Tjfc(A ). We can achieve such an increase by 
considering linear programs (|25[) and their slacks defined as 



Jikjl(X) 




Wtkji + X ikjh if j > i, 
Wtkji - Xjuk, if j < i, 



Vj,i,j¥>hi¥>k, (30) 



and update the multipliers in the following way. 

Lemma 1. The adjustment scheme below yields solutions to linear programs 
(25) with objective values i>ifc(A* +1 ) at most ^(A*) + Vifc(A') for all i,k. 

1fjlik{X) 



\t+l _ \t 
A ik]l — A ikjl 



■ <Pikjl 
<Pjlik 



■ nk 



l 

2(ni - 1) 



1 

2{n 2 - V 
1 



2(m 



1 



7r tt (A*) 

Ti»(A*) 



(31) 



for all j,l, i < j,k ^ I, where ri\ = \Vi\, n 2 = \V<i\ 
parameters. 



2{n 2 - 1) 
and < (pikji,Tji < 1 are 



Proof. We prove the lemma by showing that for any i, k there exists a feasible 
solution (/j, nk ,v nk ) to (25) whose objective value vn~(X t+1 ) is at most 7r,fe(A*) + 
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v i k(X t ). Let (// fc , v %k ) be the solution to (25) given multipliers A*. We claim that 
setting 



Hk 



2(m - l) 

7T tfc (A f ) 

2(n 2 - 1) 



Vj, j ^ i 
VI, I ^ k, 



results in a feasible solution to (25) given multipliers A* +1 . We start by showing 
that constraints (26) and (27) are satisfied. From (31) the following bounds on 
A t+1 follow. 



A 



A 



M kjl + 7«*i(A«) + (^T) + ^T) ) -..(A 4 ) Vj, l,j<i,l? k. 



t+i 

ikjl 



Therefore we have that the following inequalities imply constraints (26) and (27) 
for all j, I, j > i, I 7^ k: 



■3 " l ~ ™ ikjl + A ^"< + 7i ^ (A ' } + (2(7*1- 1) + 2(n 3 1 -l)) 
and for all j, I, j < i, I 7^ k 

Constraints (26) and (27) are indeed implied, as, for all j,l, j > i, I 7^ k, 

1 



fik 1 , Jik 



to 



Ak 1 , ,ik 



1 



7Tifc(A ) 



,2(m-i) 2(712-1), 

> to*,, + A^, + 7 *Wi(A*) + (^TJ + 2(^ZI)) 
and for all j, I, j < i, I ^= k 

1 1 



7Tifc(A ) 



2(71! -1) 2(712-1), 

> m kjl - Xjuk + 7ifeji(A*) + (2(^1) + 2(^1)) ^ (At) ' 

Since fij,vj k > (Vj, /, j ^ i,l ^ k) and by definition 7Tjfc(A*) > 0, constraints 
([28]) and (p9| are satisfied as well. The objective value of (n' lk , v nk ) is given by 



E A*f + E ^ fe = E + E v i k + ^( A ') = ^ A ') + ^( A *)- 



3 



I 



3 



I 



Since (25) are minimization problems and there exist, for all i, k, feasible solu- 
tions with objective values Ujfc(A*) + 71-^ (A*), we can conclude that the objective 
values of the solutions are bounded by this quantity. Hence, the lemma follows. 
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We use ip = 0.5, r = 1, and perform the dual descent method L successive 
times (see Algorithm [2]) . 



Algorithm 2: DualDescent(A, L) 

1 yx-0.5; [LB*,UB*] <- [Z lb (A), Z LD (A)] 

2 for n <— 1 to L do 

3 foreach i, k,j, I, i < j,k ^= I do 

Xikil <- A ifcJI +^( 7lkil + ^^ + ^%))- V (7 JItt + s ^j + 5 ^ y )) 

5 LB* <s— max[LB* , Zi b (A)] 

6 UB* «- Z LD (A) 

7 return [LB*,UB*] 



Overall method. Our overall method combines both the subgradient optimiza- 
tion and dual descent method. We do this performing the subgradient method 
until termination and then switching over to the dual descent method. This 
procedure is repeated K times (see Algorithm [3| . 



Algorithm 3: Natalie^, L, M, N) 

1 A <r- 0; [LB*, UB*] <- [0, oo] 

2 for k 1 to K do 

3 [LB*,UB*] <r- SubgradientOpt(A, M, N) n [LB*,UB*] 

4 [LB* , UB*] ^- DualDescent(A, L) n [LB* , UB*] 

5 return [LB*,UB*] 



We implemented NATALIE in C++ using the LEMON graph library (http:// 
|lemon.cs.elte.hu"7| ). The successive shortest path algorithm for maximum weight 
bipartite matching was implemented and contributed to LEMON. Special care 
was taken to deal with the inherent numerical instability of floating point num- 
bers. Our implementation supports both the GraphML and GML graph formats. 
Rather than using one big alignmen t graph , we store and use a different align- 
ment graph for every local problem ( LD'^' I . This proved to be a huge improve- 



ment in running times, especially when the global alignment graph is sparse. 
NATALIE is publicly available at |http://planet-lisa.net| 



4 Experimental Evaluation 



From the STRING database v8.3 (24], we obtained PPI networks for the fol- 
lowing six species: C. elegans (eel), S. cerevisiae (see), D. melanog aster (dme), 
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R. norvegicus (rno), M. musculus (mmu) and H. sapiens (hsa). We only consid- 
ered interactions that were experimentally verified. Table[T]shows the sizes of the 
networks. We performed, using the BLOSUM62 matrix, an all-against-all global 
sequence alignment on the protein sequences of all (2) = 15 pairs of networks. 
We used affine gap penalties with a gap-open penalty of 2 and a gap-extension 
penalty of 10. The first experiment in Section [4 . 1 1 compares the raw performance 
of IsoRank, Graal and NATALIE in terms of objective value. In Section |4.2| 
we evaluate the biological relevance of the alignments produced by the three 
methods. All experiments were conducted on a compute cluster with 2.26 GHz 
processors with 24 GB of RAM. 



species 


nodes 


annotated 


interactions 


eel (c) 


5,948 


4,694 


23,496 


see (s) 


6,018 


5,703 


131,701 


dme (d) 


7,433 


6,006 


26,829 


rno (r) 


8,002 


6,786 


32,527 


mmu (m) 


9,109 


8,060 


38,414 


hsa (h) 


11,512 


9,328 


67,858 



Table 1: Characteristics of input networks considered in this study. The columns 
contain species identifier, number of nodes in the network, number of annotated 
nodes thereof, and number of interactions 



4.1 Edge-Correctness 

The objective function used for scoring alignments in Graal counts the number 
of mapped edges. Such an objective function is easily expressible in our frame- 
work using s(a) — \{(v,w) € E\ \ (a(v),a(w)) £ £2}! and can also be modeled 
using the IsoRank scoring function. In order to compare performance of the 
methods across instances, we normalize the scores by dividing by min(|_Bi|, l-E^D- 



This measure is called the edge-correctness by Kuchaiev et al. 17 . 

As mentioned in Section [3j our method benefits greatly from using a sparse 
alignment graph. To that end, we use the e-values obtained from the all-against- 
all sequence alignment to prohibit biologically unlikely matchings by only con- 
sidering protein-pairs whose e- value is at most 100. Note that this only applies to 
NATALIE as both Graal and IsoRank consider the complete alignment graph. 
On each of the 15 instances, we ran Graal with 3 different random seeds and 
sampled the input parameter which balances the contribution of the graphlets 
with the node degrees uniformly within the allowed range of [0, 1]. As for ISO- 
Rank, when setting the parameter a — which controls to what extent topological 
similarity plays a role — to the desired value of 1, very poor results were ob- 
tained. Therefore we also sampled this parameter within its allowed range and 
re-evaluated the resulting alignments in terms of edge-correctness. NATALIE was 
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run with a time limit of 10 minutes and K = 3, L = 100, M = 10, N = 20. For 
both Graal and IsoRank only the highest-scoring results were considered. 

Figure [2] shows the results. IsoRank was only able to compute alignments 
for three out of the 15 instances. On the other instances IsoRank crashed, which 
may be due to the large size of the input networks. For Graal no alignments 
concerning see could be computed, which is due to the large number of edges in 
the network on which the graphlet enumeration procedure choked: in 12 hours 
only for 3% of the nodes the graphlet degree vector was computed. As for the 
last three instances, Graal crashed due to exceeding the memory limit inherent 
to 32-bit processes. Unfortunately no 64-bit executable was available. On the 
instances for which Graal could compute alignments, the performance — both 
in solution quality and running time — is very poor when compared to IsoRank 
and NATALIE. NATALIE outperforms IsoRank in both running time and solution 
quality. 



4.2 GO Similarity 

In order to measure the biological relevance of the obtained network alignments, 
we make use of the Gene Ontology (GO) |]. For every node in each of the six 
networks we obtained a set of GO annotations (see Table [I] for the exact num- 
bers). Each annotation set was extended to a multiset by including all ancestral 
GO terms for every annotation in the original set. Subsequently we employed 
a similarity measure that compares a pair of aligned nodes based on their GO 
annotations and also takes into account the relative frequency of each annota- 
Since the similarity measure assigns a score between and 1 to every 
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tion 

aligned node pair, the highest similarity score one can get for any alignment is the 
minimum number of annotated nodes in either of the networks. Therefore we can 
normalize the similarity scores by this quantity. Unlike the previous experiment, 
this time we considered the bitscores of the pairwise global sequence alignments. 
Similarly to IsoRank parameter a, we introduced a parameter j3 £ [0, 1] such 
that the sequence part of the score has weight (1 — /3) and the topology part has 
weight j3. For both IsoRank and NATALIE we sampled the weight parameters 
uniformly in the range [0, 1] and showed the best result in Figure[3j There we can 
see that both NATALIE and IsoRank identify functionally coherent alignments. 



5 Conclusion 

Inspired by results for the closely related quadratic assignment problem (QAP), 
we have presented new algorithmic ideas in order to make a Lagrangian relax- 
ation approach for global network alignment practically useful and competitive. 
In particular, we have generalized a dual descent method for the QAP. We have 
found that combining this scheme with the traditional subgradient optimization 
method leads to fastest progress of upper and lower bounds. 

Our implementation of the new method, NATALIE 2.0, works very well and 
fast when aligning biological networks, which we have shown in an extensive 
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Fig. 2: Performance of the three different methods for the all-against-all species 
comparisons (15 alignment instances). Missing bars correspond to exceeded 
time/memory limits or software crashes. 
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Fig. 3: Biological relevance of the alignments measured via GO similarity 



study on the alignment of cross-species PPI networks. We have compared NA- 
TALIE 2.0 to those state-of-the-art methods whose scoring schemes can be ex- 
pressed as special cases of the scoring scheme we propose. Currently, these meth- 
ods are IsoRank and GRAAL. Our experiments show that the Lagrangian re- 
laxation approach is a very powerful method and that it currently outperforms 
the competitors in terms of quality of the results and running time. 

Currently, all methods, including ours, approach the global network align- 
ment problem heuristically, that is, the computed alignments are not guaran- 
teed to be optimal solutions of the problem. While the other approaches are 
intrinsically heuristic — both IsoRank and Graal, for instance, approximate 
the neighborhood of a node and then match it with a similar node — the inex- 
actness in our methods has two causes that we plan to address in future work: 
On the one hand, there may still be a gap between upper and lower bound of 
the Lagrangian relaxation approach after the last iteration. We can use these 
bounds, however, in a branch-and-bound approach that will compute provably 
optimal solutions. On the other hand, we currently do not consider the complete 
bipartite alignment graph and may therefore miss the optimal alignment. Here, 



we will investigate preprocessing strategies, in the spirit of 25 , to safely sparsify 



the input bipartite graph without violating optimality conditions. 



The independence of the local problems ( LD^ fe ) allows for easy paralleliza- 



tion, which, when exploited would lead to an even faster method. Another im- 
provement in running times might be achieved when considering more involved 
heuristics for computing the lower bound, such as local search. More functionally- 
coherent alignments can be obtained when considering a scoring function where 
node-to-node correspondences are not only scored via sequence similarity but 
also for instance via GO similarity. In certain cases, even negative weights for 
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topological interactions might be desired in which case one needs to reconsider 
the assumption of entries of matrix W being positive. 
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